Skip to content

Implement OpenTracing-like tracer #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 17, 2025
Merged

Implement OpenTracing-like tracer #47

merged 7 commits into from
Apr 17, 2025

Conversation

Abby010
Copy link
Contributor

@Abby010 Abby010 commented Mar 24, 2025

Description

This program visualizes real-time execution flow of distributed spans in a terminal-based UI (TUI) using React Ink. It allows users to toggle between time-based (--sample 1s) and logical event-based (--sample logical) views while displaying structured span hierarchies with box-drawing characters.

Issues Fixed

Tasks

  • 1. We now have a working visualization (Attaching terminal recordings for reference at the end).
  • 2. The core library has been implemented

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

Terminal 1: --sample logical

  • Preview (https://asciinema.org/a/rsIX7rt6teGgXFUyOJsLctsoR)
  • Runs cli.tsx, utilizing React-Ink to display spans in a hierarchical tree structure based on logical parent-child relationships.
  • Organizes spans according to their causal dependencies (e.g., child spans appear under their respective parent spans).
  • Ensures that spans are grouped logically rather than by strict timestamp order, helping to visualize the structured execution flow.

Terminal 2: --sample 1s

  • Preview (https://asciinema.org/a/4u3ZkqPQu1i40ZfzlJ89Pyl91)
  • Runs cli.tsx, utilizing React-Ink to display spans sorted by their start time rather than logical hierarchy.
  • Orders spans sequentially, meaning spans that start earlier appear first, regardless of parent-child relationships.
  • Provides a time-centric view, useful for debugging latency and analyzing performance bottlenecks in real-time execution.

Copy link

linear bot commented Mar 24, 2025

ENG-234

@Abby010 Abby010 requested a review from brynblack March 24, 2025 02:24
@CMCDragonkai
Copy link
Member

Any new visualisations based on lots of lots of data??? @Abby010 and make sure you are copying over and synthesizing the details I have earlier.

@Abby010
Copy link
Contributor Author

Abby010 commented Mar 24, 2025

Any new visualisations based on lots of lots of data??? @Abby010 and make sure you are copying over and synthesizing the details I have earlier.

Not yet, I starting getting some dependency errors so I needed to fix that but I have started working with lots of data.

@Abby010 Abby010 mentioned this pull request Mar 25, 2025
8 tasks
@Abby010
Copy link
Contributor Author

Abby010 commented Mar 25, 2025

New video links have been added with more data in the description

@CMCDragonkai
Copy link
Member

Add comments rather than editing spec for that stuff.

Copy link
Member

@aryanjassal aryanjassal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most changes are just code style changes. As a summary:

  • use undefined for null-ish fields
  • check if a value is undefined or not by doing a == null or != null check (yes, even though we have undefined as the type)
  • use single quotes
  • ensure import formatting
  • for array types, always wrap it in Array<>
  • for potentially undefined fields use ?, unless you are unable to use ? for the parameter type, in which case you always use Type | undefined (as opposed to null)

@CMCDragonkai
Copy link
Member

@Abby010 can you:

  1. Get the CI working
  2. Start integrating this into the js-logger src code under a new class object Tracer.
  3. Plan out integration into Polykey-CLI for testing.
  4. Get @aryanjassal or @tegefaulkes to review to see how it can hook into object lifecycles and node native async contexts.

I think it's important for testing why seed nodes are crashing.

Also why use -+ and not \?

@CMCDragonkai
Copy link
Member

Also you should be assigning yourself.

@CMCDragonkai
Copy link
Member

Start adding tests too.

@CMCDragonkai
Copy link
Member

And benchmarks.

I want a plan as of next cycle.

src/lib/span.ts Outdated
Comment on lines 9 to 17
constructor(name: string, parentSpanId: string | null = null) {
this.spanId = `span-${Date.now()}-${Math.random().toString(36).substr(2, 5)}`;
this.name = name;
this.startTime = Date.now();
this.endTime = null;
this.parentSpanId = parentSpanId;
this.children = [];
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only things that need to be set in the constructor are things that need the be passed through the constructor. Any other property can be defaulted when defined above with public startTime: number = Date.now()

The spanId should be an IdSortable from @matrixai/id library.

@CMCDragonkai
Copy link
Member

Make sure that the span structure can be streamed into an incremental jsonl format. This is what we will be using to then stream visualise on the browser. @shafiqihtsham can weigh in if you talk to him about it.

But I want you to write out a plan for implementation in Polykey-CLI by today.

@Abby010
Copy link
Contributor Author

Abby010 commented Apr 1, 2025

Tracing Integration Plan

Phase 1: Integrate Tracing into js-logger as a First-Class Component

  • Introduce a Tracer class that encapsulates all span-related logic.
  • Replace the standalone openSpan, closeSpan functions with methods on this class.
  • Support structured span creation: startSpan, endSpan, and traced wrappers with context tracking.
  • Provide methods for exporting span data:
    • getActiveSpans()
    • flush()
    • getTraceJSON()

Add Incremental Span Streaming to JSONL

  • Each span, when closed, is written as a JSON object to a .jsonl file (e.g., spans.jsonl).

  • Enable atomic streaming via fs.appendFileSync or streaming writers.

  • Supports real-time debugging and integration with external tools.

  • Update the current TUI visualizer to consume spans from the new Tracer instance.

This phase formalizes span management, embeds JSONL observability, and prepares for broader CLI and system integration.


Phase 2: Integrate Tracer with Polykey CLI and Lifecycle Events

  • Initialize and embed the Tracer into the Polykey CLI runtime.
  • Use Tracer.traced() to instrument top-level CLI commands and subcommands.
  • Extend tracing into object lifecycle tracking: vaults, daemons, agents, etc.
  • Hook into Node.js async context (AsyncLocalStorage) to propagate spans implicitly across async calls and promise chains.
  • Stream span data from real Polykey CLI runs to the .jsonl file.
  • Analyze and visualize spans in the TUI to confirm accurate instrumentation.
  • Use the streamed data for:
    • Automated debugging
    • Replay analysis
    • Bug discovery

This phase validates the integration by capturing real CLI behaviors and preparing for long-term observability and tracing.

@CMCDragonkai
Copy link
Member

Your phase 2 doesn't mention https://github.com/MatrixAI/js-async-init at all. You both need to talk about the implementation plan with @tegefaulkes since I think you're missing some contextual details critical to understanding how tracing can be applied to Polykey's runtime.

@Abby010
Copy link
Contributor Author

Abby010 commented Apr 13, 2025

Phase 2 Plan: Polykey CLI Tracing Integration

Objective

Instrument the Polykey CLI runtime using the Tracer to support structured, memory-safe, and streamable observability of system behavior — with minimal invasiveness and full reversibility.


1. Embed Tracer into Polykey CLI Runtime

  • Inject a Tracer instance during CLI bootstrap.
  • Pass it into relevant components (e.g., Agent, VaultManager) via dependency injection or service context.
  • Ensure span data is optional and tracing is enabled only via flags like --enable-tracing.

2. Hook into Lifecycle Events via js-async-init

Utilize @AsyncInit decorators and structured lifecycle events defined in the js-async-init library.

For async-initialized components like NodeConnectionManager, VaultManager, Daemon, etc., hook into their lifecycle using:

  • EventStart, EventStop
  • EventCreate, EventDestroy

Instrumentation Pattern:

nodeManager.addEventListener(events.NodeConnectionManagerStart.name, () => {
  nodeManager.span = parentSpan.getChild('NodeManagerStart');
});

nodeManager.addEventListener(events.NodeConnectionManagerStop.name, () => {
  nodeManager.span.end();
});

3. Adopt Span Object Model (No Global Tracking)

Replace global tracer calls with direct Span object usage.

Example:

const span = parentSpan.getChild('VaultFetch');
span.event('RPC Request Sent');
span.end();

Track spans via objects, not IDs. Avoid propagating span IDs through business logic.

4. Stream Output Instead of Writing Directly

Refactor Tracer to emit to a writable stream (WritableStream, PassThrough, etc.).

Consumers of the stream (e.g., file writer, stderr, custom logger) decide how to handle the data.

CLI Support:

--trace-output spans.jsonl
--trace-output stderr
--trace-fd=3    # for advanced use

5. Memory Management: No activeSpans, Use WeakRefs

Remove Map<string, Span> used for global span tracking.
Span objects should be ephemeral and garbage collected naturally.

Use WeakRef where necessary for parent-child relationships.
Prevent span graphs from being retained in memory permanently.

6. Implement Observable Tracing Events

Tracer emits observable events:

span:start
span:end

Useful for real-time debugging, TUI enhancements, or plugin systems.

7. Testing Strategy: Use fast-check for Randomized Span Behavior

Replace Math.random() in test harnesses with fast-check property-based fuzzing.

Supports reproducible failures and automated test shrinking for minimal failing cases.

8. Visual Debugging with TUI (as-is)

No changes needed to TUI.

Continues to read from the .jsonl file.

9.Leveraging js-async-init's Lifecycle Contracts

Use lifecycle annotations from js-async-init to instrument long-lived services and ephemeral agents with start/stop or create/destroy lifecycles. The Tracer will hook into lifecycle phases without modifying the business logic directly.

Each lifecycle boundary maps naturally to a span (e.g., VaultManagerStart, DaemonCreate, NodeConnectionDestroy).
This ensures:

  1. Lifecycle spans align with runtime semantics.
  2. Tracing remains declarative and reversible.
  3. No modification to object constructors or internals.

Copy link
Member

CMCDragonkai commented Apr 14, 2025

Some things to review:

  • https://github.com/tc39/proposal-async-context
  • Review what runtimes support it already (node already does)
  • The "span" object should be more implicit, it explore the idea of using WeakMap and symbols for synthetic identifiers.
  • Right now committing to an integration path doesn't make sense, try monkey patching at the Polykey runtime itself, and just output logs and visualising that separately.

@CMCDragonkai
Copy link
Member

Reviewing https://github.com/tc39/proposal-async-context shows that:

  1. https://github.com/tc39/proposal-async-context AsyncContext integration with web APIs w3c/tpac2023-breakouts#39 https://nodejs.org/api/async_context.html all indicate that this feature isn't available widely in all of JS.
  2. If we were to use async context, the logger would be forced to only work in nodejs runtime.
  3. This is fine for debugging PK, but not so much crossing into browsers and we want this library to be generic.
  4. I realise that we should be able to "approximate" this feature by using our decorators, we have js-contexts https://github.com/matrixai/js-contexts that brings together a generic way of creating decorators providing context. It should be possible to use that along with js-async-init to create some sort of more implicit system, that can at first use the node's async context when available, or use something a bit more manual when it is not available (like a combination of weakmap and symbols).

@Abby010 feedback?

@CMCDragonkai
Copy link
Member

image

@CMCDragonkai
Copy link
Member

Key thing about this is that our spans don't necessarily end to be useful.

@Abby010
Copy link
Contributor Author

Abby010 commented Apr 14, 2025

Reviewing https://github.com/tc39/proposal-async-context shows that:

  1. https://github.com/tc39/proposal-async-context AsyncContext integration with web APIs w3c/tpac2023-breakouts#39 https://nodejs.org/api/async_context.html all indicate that this feature isn't available widely in all of JS.
  2. If we were to use async context, the logger would be forced to only work in nodejs runtime.
  3. This is fine for debugging PK, but not so much crossing into browsers and we want this library to be generic.
  4. I realise that we should be able to "approximate" this feature by using our decorators, we have js-contexts https://github.com/matrixai/js-contexts that brings together a generic way of creating decorators providing context. It should be possible to use that along with js-async-init to create some sort of more implicit system, that can at first use the node's async context when available, or use something a bit more manual when it is not available (like a combination of weakmap and symbols).

@Abby010 feedback?

Re: async_context Proposal
I’ve reviewed tc39/proposal-async-context and the Node.js async_hooks API. While these are promising for implicit span propagation, they’re not yet cross-platform.

Agreed for now, relying solely on Node-specific behavior would limit portability. Since js-logger should remain runtime-agnostic, we shouldn’t hard-bind to Node.js-only primitives.

Re: Span Should Be More Implicit
Makes sense. Right now, spans are passed explicitly or manually linked via getChild(). I'm exploring using:

  1. Symbols to tag async flows.
  2. WeakMap to manage span-local state for async resources as suggested
  3. Decorators (js-contexts + js-async-init) to set up span context implicitly without forcing manual tracing at every step.

This would allow something like:

@Traceable
async function syncVault() {
  // span is implicitly available via decorator-generated context
  currentSpan.event('Vault Sync Triggered');
}

Re: Monkey Patching in Polykey Runtime
That aligns with the Phase 2 goal: non-invasive, reversible tracing.
Before embedding tracing into the CLI codebase, We will monkey-patch lifecycle events in runtime modules (like VaultManager, NodeConnectionManager) and observe the emitted span logs via JSONL/TUI.

This will let us test the conceptual model before formalizing APIs or integrations.

Re: js-contexts + js-async-init Hybrid
I’m looking into combining js-async-init lifecycle events with js-contexts to simulate AsyncLocalStorage-like behavior generically.
Idea is to support both:
Native async context via Node.js hooks.
Fallback shim using js-contexts and WeakRefs where native async context isn't available.

Comment on lines +51 to +68
current.flags = fc.sample(flagArb, 1)[0];
current.parentId = openSpan(`Parent-${parentIndex}`);
nestedIds = [];
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sampling like this, I would recommend doing something like taking the current Unix time and using that as the seed, and increase the value by one every time you want to sample something new. You should also print the seed value to the console. This will ensure that you are using randomised values every time, and if something goes wrong, you can copy-paste the seed to get the same values to investigate why things went wrong. You should fix an error which happens once in a few runs, as that is an edge case you need to consider instead of rerunning the test and ignoring it.

Otherwise you've done pretty good integrating fast-check into this test.

@CMCDragonkai
Copy link
Member

The original plan was for @aryanjassal to start using this in Polykey regardless of the merge. But if that's too difficult. Start merging this in now incrementally and release new ones so that PK can start testing it. This week it has to be done. There's no more time to be allocated to this.

@CMCDragonkai
Copy link
Member

@aryanjassal you now take over this PR to force it to be merged and released.

@CMCDragonkai
Copy link
Member

@aryanjassal start squashing and merge into a patch version, then use it in a new PR in Polykey-CLI.

@aryanjassal aryanjassal changed the title Feature-Tracing-Visualization Implement OpenTracing-like tracer Apr 17, 2025
@aryanjassal aryanjassal removed the request for review from brynblack April 17, 2025 03:16
@aryanjassal
Copy link
Member

Due to not having sufficient time for the handover from abby, there might be some minor issues which might make it to staging. This might see a lot of iteration as well. All the iteration will be done in patch updates which will be used by Polykey CLI.

The js-async-init package will be monkey-patched first to get kind of a demo of the library, and iteration will be made to this library until we have reached a good state to analyse this in-depth.

@aryanjassal
Copy link
Member

I've cleaned up the codebase a little bit, so this can now be merged to staging. I will need to go over some things with abby about the future plans of the tracer, but until then I can start testing out monkey-patching.

It doesn't make much sense to make a PR for this as the change will be in node modules, which is local-only and making a PR will not make a difference. I will however make a branch for this and push up a PR if any significant changes are made in there.

Abby010 added 7 commits April 17, 2025 16:12
fix: logical mode

fix: time based mode

chore: removed unnecessary comments

fix: package-lock.json

chore: deleted unnecessary asciinema recordings

fix: package.json
chore: addressed PR comments
chore: cleaning up codebase

chore: removed json files

fix: lint
@aryanjassal aryanjassal marked this pull request as ready for review April 17, 2025 06:17
@aryanjassal aryanjassal merged commit b8a2040 into staging Apr 17, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Integrate Tracing (derived from OpenTelemetry)
4 participants